CSCR010: Second Year Report

نویسندگان

  • Yanbo J. Wang
  • Paul Leng
چکیده

The aim of my PhD research is focused on Text Mining, one major research school in Knowledge Discovery in Databases (KDD), and in particular Text Preprocessing (TPP) for classification / categorization of documents utilizing novel algorithms for the identification of hidden patterns, rules, regularities and trends within these documents. Significant techniques in Data Mining, another wellknown KDD research school, are involved to support this research, especially when dealing with a very large documentbase, such as Association Rule Mining (ARM), Classification Rule Mining (CRM), etc. In this report, a set of academic issues is included: (a) a brief summery of the work that I have done in my first year PhD research; (b) a clear statement of my PhD research topic, which focuses on finding novel approaches to pre-processing documentbase in Text Classification; (c) a survey of the current existing TPP approaches; (d) a clear description with experiments about three TPP approaches that I have proposed in my second year PhD research; and (e) a year-ending conclusion of my PhD research with a list of potential investigated TPP approaches as my on going PhD research. 1 SUMMARY OF THE FIRST YEAR REPORT In my first year PhD research: I have categorized a number of common serial Association Rule Mining (ARM) algorithms into three distinct “families”: (a) generating Frequent Itemsets (FIs), (b) generating Maximal Frequent Itemsets (MFIs), and (c) generating Frequent Closed Itemsets (FCIs), and have identified four essential issues in ARM: (i) horizontal data layout vs. vertical data layout, (ii) bottom-up mining approach vs. top-down mining approach, (iii) breadth-first search strategy vs. depth-first search strategy, and (iv) set-enumeration-tree structure vs. other data structures; I have pointed out that ARM techniques can be addressed into six different Text Mining tasks: (a) association extraction, (b) prototypical document extraction, (c) classification knowledge extraction, (d) descriptive representation of documents, (e) meaning discovery of ambiguous words, and (f) annotated-text-based generalized sequences extraction; I have collected and demonstrated five various existing approaches in Text Pre-processing (TPP): (a) full text approach, (b) keyword / indexed-data approach, (c) prototypical document approach, (d) multi-term text phrases approach, and (e) concept approach”; I have separately summarized various techniques and methods in Classification Rule Mining (CRM) and Prediction Rule Mining (PRM), and stressed on describing the ARM-based CRM techniques in CRM summary (i.e., “Classification Based on Associations (CBA)”, “Classification based on Multiple Association Rules (CMAR)”, “Classification based on Predictive Association Rules (CPAR)”, etc.); and I have completed a report stating combinations of different Natural Language Processing (NLP) techniques in the stage of TPP that includes Part-of-Speech (POS) tagging, multi-word term generation, single/ multi-word term selection, etc. 2 MY PhD RESEARCH TOPIC With a significantly increased amount of electrical documents, especially web documents, Text Classification has become a more and more popular research field in Knowledge Discovery in Databases (KDD) and Machine Learning (ML). The distinction is that KDD-based research is focused on statistical techniques, while ML-based research focuses on artificial-intelligence techniques. Consequently, there are a variety of methods in both KDDand/or MLbased Text Classification. It has been established that whatever the case, Text Mining requires the documentbase to be pre-processed so that it is in an appropriate format for the application of KDDand/or MLbased Text Classification techniques. The nature of the pre-processing can be characterised as either: (i) document representation, and (ii) feature selection. In (i), the bag of words approach is common sometimes-incorporating issues of local word frequency and local word position within the given documentbase. Further we can identify five different forms of the “bag of words” approach: (a) bag of single-words [1], (b) bag of keywords [19] [21], (c) bag of prototypical-phrases (bag of key-phrases) [19], (d) bag of multiword-phrases [2], and (e) bag of concepts [20]. In (ii) common methods include: stop-list [13], stemming [16] [17], frequency weighting [12], latent semantic indexing [3], log likelihood ratio [14], odds ration [18], POS tagging [5] [6] [11], Chi-square statistics [23], etc. Therefore, the topic of my PhD research focuses on finding novel approaches to pre-processing documentbase in Text Classification. 3 BACKGROUDS OF TEXT PRE-PROCESSING (TPP) In my first year PhD research, I have worked out an initial survey to roughly demonstrate five various existing approaches in Text Pre-processing (TPP). Concerning carefully with some specific considerations in Text Mining and significant aspects of textual data, five existing TPP approaches can be updated and learned as follows. 3.1 Bag of Single-words Approach The most common form of the bag of words approach [4] is as a collection of single-words with no additional information regarding frequency. The main advantage of the approach is its simplicity, however, the bag of words may be very large and consequently on overwhelming number of rules may be generated, and many of them are uninteresting. In [1], Ahonen-Myka et al. go some way towards addressing this problem on pruning the bag of words with reference to a stop list. Definition 1. Bag of Single-words approach: Let T = {T1, T2, ..., Tm} be a collection of documents. Let I = {I1, I2, ..., In} be the set of all single-words appearing in T. Each document Ti is a subset of I. 3.2 Bag of

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Brooke-Spiegler Syndrome: a case report

Brooke-Spiegler syndrome is a rare autosomal recessive disease characterized by adnexal neoplasms, particularly trichoepithelioma, cylindroma, and occasionally spiradenoma, which usually develop in second to third decades of life. We report this syndrome in a 16-year-old woman with tumors on face and scalp.

متن کامل

Report of a four-year-old boy with Progeria without total Alopecia

Hutchinson-Gilford progeria syndrome is an extremely rare condition with features of premature and accelerated aging. The pattern of inheritance is unclear, although autosomal dominant mutations have been proposed. The disease presentation is usually in infancy and early childhood with a characteristic phenotype of short stature, abnormal skin and nail, beaked nose, loss of subcutaneous f...

متن کامل

The Flare Up of Catastrophic Antiphospholipid Syndrome: a Report of an Immunosuppressive Withdrawal-Induced Case

Antiphospholipid syndrome (APS) is a systemic disease that causes venous and arterial thrombosis in virtually any organ. Sometimes it is complicated into pulmonary infarction and cavitation, pulmonary hypertension, and catastrophic course with high morbidity and mortality. The present case is a 35-year-old woman with one episode of postpartum deep veins throm-bosis (DVT) 12 years earlier and th...

متن کامل

Benign Cementoblastoma Involving Deciduous and Permanent Mandibular Molars: A Case Report

Cementoblastomas are rare benign odontogenic tumors. Diagnosis of these lesions must be made by an association of clinical, radiographic, and histopathological findings. Cementoblastomas rarely occur in both primary and permanent dentitions. We describe the sixth case of cementoblastoma in the literature with the involvement of both deciduous and permanent teeth. The aim of this case report is ...

متن کامل

A case report of successful apexification in mandibular second molar, using cold ceramics”

Background and Aim: Nowadays, in the treatment of necrotic teeth with open apex, based on the availability of bioceramic biomaterials with the ability of calcified tissue induction on cement and bone and periodontal ligament repairment stimulation using these materials has become more acceptable than the traditional calcium hydroxide apexification method. Cold ceramic is one of these bioceramic...

متن کامل

Bilateral Second Arch Branchial Fistula-A Case Report

Introduction: Branchial arch anomalies represent defects in embryological developments whereby parts of the branchial arch persist in the head and neck regions as sinuses, fistulas, or cysts. These anomalies usually present as a unilateral lesion in the head and neck of young adults and children, which are excised upon the emergence of complications.   Case Report: Herei...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005